EPJ manuscript No. 

(will be inserted by the editor) 



Extraction of physical laws from joint experimental data 



r» ■ 
O ; 

Igor Grabec 

<N 

O ^ Faculty of Mechanical Engineering, University of Ljubljana, 
, Askerceva 6, PP 394, 1001 Ljubljana, Slovenia, 

(N ■ 

Tel: +386 01 4771 605, Fax: +386 01 4253 135, 

rH ■ 

■ E-mail: igor.grabec@fs.uni-lj.si 

a 
V. 
o 



Oh 



> 

in 
o 

o 
o 



X 



Received: date / Revised version: date 

Abstract. The extraction of a physical law y — y (x) from joint experimental data about x and y is 
treated. The joint, the marginal and the conditional probability density functions (PDF) are expressed by 
given data over an estimator whose kernel is the instrument scattering function. As an optimal estimator 
of y (x) the conditional average is proposed. The analysis of its properties is based upon a new definition 
of prediction quality. The joint experimental information and the redundancy of joint measurements are 
expressed by the relative entropy. With the number of experiments the redundancy on average increases, 
while the experimental information converges to a certain limit value. The difference between this limit 
value and the experimental information at a finite number of data represents the discrepancy between 
the experimentally determined and the true properties of the phenomenon. The sum of the discrepancy 
measure and the redundancy is utilized as a cost function. By its minimum a reasonable number of data 
for the extraction of the law y (x) is specified. The mutual information is defined by the marginal and 
the conditional PDFs of the variables. The ratio between mutual information and marginal information 
is used to indicate which variable is the independent one. The properties of the introduced statistics are 
demonstrated on deterministically and randomly related variables. 



PACS. 06.20.DK Measurement and error theory - 02.50.+S Probability theory, stochastic processes, and 
statistics - 89.70.+C Information science 
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1 Introduction 

The progress of natural sciences depends on advancement 
in the fields of experimental techniques and modeling of 
relations between experimental data in terms of physical 
laws. [TT2] By utilizing computers a revolution appeared 
in the acquisition of experimental data while modeling 
still awaits a corresponding progress. For this purpose the 
modeling process should be generally described in terms 
of operations that could be autonomously performed by a 
computer. A step in this direction was taken recently by a 
nonparametric statistical modeling of the probability dis- 
tribution of measured data. [3 a The nonparametric model- 
ing requires no a priori assumptions about the probability 
density function (PDF) of measured data and therefore 
provides for a fairly general and autonomous experimen- 
tal modeling of physical laws by a computer. [114] More- 
over, the inaccuracy of measurement caused by stochastic 
influences can be properly accounted for in the nonpara- 
metric modeling that further leads to the expression of ex- 
perimental information, redundancy of repeated measure- 
ments and model cost function in terms of entropy of infor- 
mation. These variables have already been applied when 
formulating an optimal nonparametric modeling of PDF, 
in the most simple case of a one-dimensional variable. [3] 
However, more frequently than modeling of a PDF the 
problem is to extract a physical law from joint data about 
various variables and to analyze its properties. Therefore, 
the aim of this article is to propose a general statistical 
approach also to the solution of this problem. 



laws from joint experimental data 

As an optimal statistical estimator of an experimen- 
tal physical law we propose the conditional average (CA) 
that is determined by the conditional PDF. 1 This esti- 
mator represents a nonparametric regression whose struc- 
ture is case independent; hence it can be generally pro- 
grammed and autonomously determined by a computer. 
Due to these convenient properties, we consider CA as a 
basis for the autonomous extraction of experimental phys- 
ical laws in data acquisition systems. 

The fundamental steps of the proposed approach to 
extraction of experimental physical laws from given data 
are explained in the second section. We first define the 
estimators of the joint, the marginal and the conditional 
PDFs and derive from them the conditional average as 
an optimal estimator of a physical law that is hidden in 
joint data. In order to estimate the number of data ap- 
propriate for the extraction of a physical law, we further 
introduce the statistics that characterize the information 
provided by joint measurements. In the third section of 
the article the properties of the CA estimator and the 
other introduced statistics are demonstrated on cases of 
deterministically and randomly related data. 

2 Statistics of joint measurements 

2.1 Uncertainty of experimental observation 

Without loss of generality we consider a phenomenon that 
can be quantitatively characterized by two scalar valued 
variables x and y comprising a vector z = [x, y). We fur- 
ther assume that the phenomenon can be experimentally 
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explored by repetition of joint measurements on a two- We further consider the most frequent case in which the 

channel instrument having equal spans S x = (—L,L), output scattering does not depend on the channel index 

S y = (—L,L). Their Cartesian product S xy = S x ® S y and the position w = (u,v) on the joint scale. In this 

determines the joint span. We treat a measurement of a case it can be expressed as a function of the difference 

joint datum as a process in which the measured object z — w = (x — u, y — v) and a common standard devia- 

generates the instrument output z = (x,y). The basic tion a = a x = a y as ip(z\w) = i/j(z — w, a). We consider 

properties of the instrument and measurement procedure scattering of instrument output during calibration as a 

can be characterized by a calibration based on a set of consequence of random disturbances in the measurement 

objects {wfc; = (uk,vi); k = 1, .. . 1 = 1,...} that repre- system. When these disturbances are caused by contribu- 

sent joint physical units. Using these units, a scale net can tions from mutually independent sources, the central limit 

be determined in the joint span S xy of the instrument. In theorem of the probability theory leads us to the Gaussian 

order to simplify the notation, we further omit the indices scattering function tp(z — w, a) = g(x — u, cr)g(y — v,a), in 

of units. which the scattering of a single component is determined 



A common property of measurements is that the out- 



by: 



put of the instrument fluctuates even when calibration tp(x\u) g(x u, a) exp 



(x — u) z 
2a 



(1) 



2.2 Estimation of probability density functions 



is repeated. |1I2| We describe this property by the joint 
PDF ijj(z\w), which characterizes the scattering of the in- 
strument output at a given joint unit w. For the sake Let us consider a single measurement which yields a joint 



of simplicity, we consider an instrument whose channels datum Zi = (x\,yi). We assume that this joint datum 

can be calibrated mutually independently. In this case the appears at the outputs of instrument channels, since it is 

instrument scattering function is expressed by the prod- the most probable at a given state z of the observed phe- 

uct of scattering functions corresponding to both channels nomenon and the instrument during measurement. There- 

■i/'(z|w) = ip(x\u)'>p(y\v). Their mean values it, v, and stan- fore, we utilize the measured datum Zi as the center of the 

dard deviations a x , a y represent an element of the instru- probability distribution ?/>(z — zi, a) = ip(x — xi, <r)ip(y — 

ment scale and the scattering of instrument output at the yi, a) that represents the corresponding state, 
joint calibration. These values can be estimated statisti- Consider next a series of N repeated measurements 

cally by the sample mean and variance of both components which yield the basic data set {z^; i = 1, ...,iV}. In ac- 

measured during repeated calibration by a joint unit w. cordance with the above-given interpretation of measured 

The standard deviation a characterizes the uncertainty data we adapt to them the distributions {ip(z — z^, er); z = 

of the measurement procedure performed on a unit. [Il2j 1, . . . , N}. If the data zi, . . . , zjv are spaced more than a 
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apart, we assume that their scattering is caused by varia- estimators: 

tion of the state z in repeated measurements and generally 1 

In{x) = — 2_^if>{x - Xi,a) (5) 

consider z as a random vector variable. Its joint PDF is i=1 

determined by the statistical average over distributions fN(y\x) — ^- /i=1 '^^ X Xt ' a )' l l J (y Un 17 ) ^ 

{tp(z - Zi,a);i = 1, . . . , N} as: 

JY 



2.3 Estimation of a physical law 

> I -17. — 7; n) (V| 



i=l 

It is often observed that the joint PDF resembles a crest 



w») = iE*-«f.^ ( 2 ) 

i=l 

This function represents an experimental model of PDF 

. along some line y — y{x). We consider y(x) as an estimator 

and resembles Parzen s kernel estimator, which is often 

... (T3T ™ nrm u • t> i of a hidden physical law y = y a (x) that provides for a 

used m statistical modeling of PDi s. 5 4 However, m Parzen s 

prediction of a value y from the given value x. If we repeat 
joint measurements, and consider only those that yield 
the value x, we can generally observe that corresponding 
values of the variable y are scattered, at least due to the 
stochastic character of the measurements. As an optimal 
predictor of the variable y at the given value x, we consider 
the value y that yields the minimum of the mean square 
prediction error D at a given x: 



modeling the kernel width a plays the role of a smooth- 
ing parameter whose value decreases with the number of 
data N, which is not consistent with the general proper- 
ties of measurements. In opposition to this, we consider a 
as an instrumental parameter that is determined by the 
inaccuracy of measurement. [511] In the majority of experi- 
mental observations a is a constant during measurements, 
and hence need not be further indicated in the scattering 
function^. r> = E[(y - y) 2 \x] = min(y) (7) 



From the joint PDF /(z) = f(x, y) the marginal PDF 
f(x) of a component x is obtained by integration over the 
other component, for example: 



The minimum takes place when dD/dy = 0. The solu- 
tion of this equation yields as the optimal predictor y the 
conditional average 

f(x) = [ f(x,y)dy (3) r 

y V(x) - E[y\x] = I yf(y\x)dy (8) 

The conditional PDF of the variable y at a given condition 

x is then defined by the ratio of the joint PDF and the 

marginal PDF of the condition: 



By using Eq. [S] for the conditional probability, we obtain 
for CA the superposition 



,, \ Vn(x) = n -=2^ViCi(x) (9) 

mx) = lMl (4 ) 



The coefficients 



Using the experimental model of joint PDF @ we obtain 

ip(x - Xj, a) 

for the marginal and conditional PDFs the following kernel X^i VK^ — x i > cr ) 



Ci{x) = ~ U ' ; (10) 
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represent a normalized measure of similarity between the 
given value x and sample values Xi and satisfy the condi- 
tions: 

JV 

J2Ci(x) = l, (11) 

i=l 

< d{x) < 1. (12) 

The more similar given value a: is to a datum Xi , the larger 
the coefficient Ci(x) is and the contribution of the corre- 
sponding term yiCi{x) to the sum in Eq.®. The pre- 
diction of the value ytj(x), which best corresponds to the 
given value x, thus resembles the associative recall of mem- 
orized items in the brains of intelligent beings, and there- 
fore could be treated as a basis for the development of 
computerized autonomous modelers of physical laws and 
related machine intelligence. pQ 

The predictor Eq. ([9|) is completely determined by the 
set of measured data {z — Zj; i = 1, . . . , N} and the in- 
strument scattering function ip. The predictor is not based 
on any a priori assumption about the functional relation 
between the variables x and y, as is done for example 
when a physical law is described by some regression func- 
tion in which parameters are adapted to given data. The 
conditional average Eq. @ can thus be treated as a non- 
parametric regression, although the scattering functions 
V>(z — Zj, a) still depend on the parameters Zj, a. However, 
these parameters, as well as the form of the function ip, 
are totally specified by measurements. They represent a 
property of the observed phenomenon and not an assumed 
auxiliary of the modeling. Since the form of the CA pre- 
dictor does not depend on a specific phenomenon under 
consideration, it could be considered as a generally ap- 
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plicable basis for statistical modeling of physical laws in 
terms of experimental data in an autonomous computer. 
It is convenient that Eq. can be simply generalized to a 
multi-dimensional case by substituting the condition and 
the estimated variable by the corresponding vectors. [T] 
Moreover, it is convenient that the ordering into depen- 
dent and independent variables is done automatically by 
a specification of the condition. 

2.3.1 Description of predictor quality 

We can interpret a phenomenon which is characterized by 
the vector z = [x, y) as a process that maps the vari- 
able x to the variable y. When the variables x and y are 
stochastic, we most generally describe this mapping by the 
joint PDF f(x,y). Similarly, we can interpret the predic- 
tion of the variable y(x) from the given value a; as a pro- 
cess that runs in parallel with the observed phenomenon. 
This process is also generally characterized by the PDF 
f(x,y), while the relation between the variables y and y 
is characterized by the PDF f(y,y). The better the pre- 
dictor is, the more the distribution f(y, y) is concentrated 
along the line y — y(x). For a good predictor we generally 
expect that the prediction error E r = y — y is close to 
0. Since both variables are considered as stochastic ones, 
we expect that the first and second moments of the pre- 
diction error E[y — y], E[(y — y) 2 } are small, while for 
an exact prediction E[y — y] — 0, and E[(y — y) 2 } = 0. 
The second moment of the error is equal to E[(y — y) 2 ] = 
Var(y)+Var(y)— 2Cov(y, y)+(m y — m^) 2 , where m y = ~E[y] 
and rriy = E[y] denote mean values. If the variables y and y 
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are statistically independent and have equal mean values, noise v with zero mean E[i/] = 0, and variance E[i/ 2 ] = a 2 , 

the covariance vanishes: Cov(y,y) = 0, and m y — m y = 0, we can express the variable y as y = y {x) + v. In this 

so that ~E[(y — y) 2 ] — Var(y) + Var(y). Based upon this case the following equations: E[(y — y) 2 ] = a 2 , Var(y) = 

property we introduce a relative statistic called the pre- Var(y) + a 2 hold, and we get for the expected predictor 

dictor quality with the formula quality the expression: 

Jjfe-fL Q 2Var(y) 

Var(y) + Var(y) * ~ 2 Var(y) + o 2 ' 1 ' 

2Cov(y,£) (m y - m y ) 2 



Var(y) + Var(y) Var(y) + Var(y) For Var(y) > a 2 /2 we have Q ~ 1, while for Var(y) < 

Its value equals 1 for an exact prediction: y = y, while it ^V 2 w e have Q w 0. In the last case t) w constant, while 

equals 0, if the variables y, y are statistically independent 2/ fluctuates around this constant, and consequently the 

and have equal mean values. If the mean values differ: prediction quality is low. 

m v -m y ^0, the quality Q can also be negative. Since generally Var(y) > Var(y) and Var(y) > 0, we 

When the predictor is determined by the conditional obtain from Eq. ((TTJ) the inequality < Q < 1. It describes 

average ©, we obtain for its mean value a mean property, which need not be fulfilled exactly if the 

conditional average is statistically estimated from a finite 



E[y] = / yf(x)dx = I I yf(y\x)f(x)dxdy 

number of samples N; but we can expect that it holds 

yf(y,x)dxdy = E[y] = m y . (14) 

ever more with an increasing N. However, we can gen- 
Since in this case m y — m y — 0, we further get erally expect that with an increasing N, the statistically 

^ 2Cov(y, y) ^ estimated CA ever better represents the underlying physi- 



cal law y = y (x). However, with an increasing N, the cost 

Similarly we get for the covariance r . , . , , , , , 

of experiments increases, and consequently there generally 

Cov{y, y) = J f(y- m y )(y(x) - m y (x)])f(y, x)dxdy appears the question: "How to specify a number of sam- 

, . w . . . , . , pies N that is reasonable for the experimental estimation 

(y{x) - m y (x))(y - m y )f{y\x)dyf{x)dx 

of a hidden law y (x)T' 

= I (y(x)-m y (x)) 2 f(x)dx = Var(y), (16) 



so that the expected quality of the CA predictor is « , ,- . . . . r 

v ^ J i 2.4 Experimental information 

Q = 2Var ^ (17) 

Var(y) + Var(y) In order to answer the last question, we proceed with the 

In the case when the relation between both components of description of the indeterminacy of the vector variable z 

the vector z is determined by some physical law y {x), and in terms of the entropy of information. Following the def- 

only the measurement procedure introduces an additive initions given for a scalar random variable in the previous 
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article, [3] we first describe the indeterminacy of the com- define the indeterminacy of the random vector z by the 
ponent x. For this purpose we introduce a uniform refer- negative value of the relative entropy: [6 



ence PDF p(x) = 1/(2L) that hypothetically corresponds 

to the most indeterminate noninformative observation of J J s * 

variable x; or to equivalently prepared initial states of the In the case of a uniform reference PDF we obtain 
instrument before executing the experiments in a series 
of observations. By using this reference and the marginal 

PDF f(x), we first define the indeterminacy of a continu- With this formula we then express the uncertainty of the 



//,.„ = - / / /(Z)l0g(^ ),/, ,/,/. (211 



H *y = ~ff s /(z)log(/(z))d^-21og(2L). (22) 



ous random variable by the negative value of the relative joint instrument calibration as 
entropy [6 7 



flw = -/ / ^(z,w)log(^(z,w))da;dy-21og(2L). 
r ft ) v 

H x = - J s f(x)log(^)dx. (19) (23) 

For it « L we obtain from the Gaussian scattering func- 

Using the expressions for the reference, instrumental scat- 

tion tf)(z, Zi) = g(x — Xi, cr)g(y — Ui, a) the approximation 

tering function, and experimentally estimated PDF, we 

2 

obtain the expressions for the uncertainty H u of calibra- i? w ~ + 1°S 7^ + 1> (24) 



tion performed on a unit u, the uncertainty H x of the 
component x, experimental information I x provided by 
N measurements of x, and the redundancy R x of these 
measurements as follows [3]: 



The uncertainty of calibration depends on the ratio be- 
tween the scattering width 2a and the instrument span 2L 
in both directions. The number 2 log(a/L) determines the 
lowest possible uncertainty of measurement on the given 
H u = — I tp(x, u) log(ijj(x, u)) dx — log(2i), two-channel instrument, as achieved at its joint calibra- 

tion. 

The indeterminacy of the random vector z, which char- 
acterizes the scattering of experimental data, is defined by 
the estimated joint PDF as 



H x = - [ f N (x) \og(f N ( x )) dx - log(2L), 
I X (N) = H X -H U , 

i^(JV) = log(iV) - I X (N), (20) 



Similar equations are obtained for the component y by „ „ 

H xy = - / A r(z)log(/ Ar (z))dxdy-21og(2i) (25) 

substituting x — > y. J J s xy 

In order to describe the uncertainty of the random vec- and is generally greater than the uncertainty of calibra- 
tor z, we utilize the reference PDF that is uniform inside tion described by i/ w . Since -ff w denotes the lowest possi- 
the joint span S xy : p(z) = p(x)p(y) = 1/(2L) 2 , and van- ble indeterminacy of observation carried out over a given 
ishes elsewhere. By analogy with the scalar variable we instrument, we define the joint experimental information 
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I xy about vector z = (x, z) by the difference When the distributions ip(z, Zj) are nonoverlapping, N 

repeated experiments yield the maximal possible informa- 



Ixy{N) = H xy — iJ w 

f N (z)log(f N (z))dxdy 
ip(z, w) log(V>(z, w)) dxdy. (26) 



tion log(iV). However, with an increasing number N, ever 
more overlapping of distributions i/>(z, z^) takes place, and 
therefore the experimental information I xy (N) increases 
Most properties of the uncertainty and information apper- more slowl y than lo s( N ) ■ Consequently the repetition of 
taining to a random vector are similar to those in the case j° int measurements becomes on average ever more redun- 
of a scalar variable. For example, the reference density p(z) dant with an increasing number N. The difference 



can be arbitrarily selected since it is excluded from the 
specification of the experimental information.^ Further- 
more, the joint experimental information I xy (l) provided 
by a single measurement is zero. For a measurement which 
yields multiple samples z\, . . . , zjv that are mutually sep- 
arated by several a in both directions, the distributions 
V>(z, Zi) = g{x — Xi, cr)g(y — yi, a) are nonoverlapping and 
the first integral on the right of Eq.[2H] can be approxi- 
mated as 



R xy {N)=log(N)~I xy (N). (28) 

thus represents the redundancy of repeated joint measure- 
ments in TV experiments. Since the overlapping of distri- 
butions tp(z, z^ increases with an increasing number of ex- 
periments, the experimental information on average tends 
to a constant value I xy (oo), and along with this, the re- 
dundancy increases with N. 
The number 

n . . M n K 3n/ (N)=e I "W (29) 



i—1 i—1 



Z, Z 



dxdy 



log(N)-J y^(z,z 1 )log^(z,z 1 )dxd?/ (27) 



describes how many nonoverlapping distributions are needed 
to represent the experimental observation. With an in- 
so that we get I xy {N) w log(TV). If the distributions -0(z, z^) creasing N, the number K xy (N) tends to a fixed value 
are overlapping but not concentrated at a single point, the K xy (oo) that can be well estimated already from a finite 
inequality < I xy (N) < \og(N) holds generally. Similarly number of experiments. We could conjecture that ^^(oo) 
as the entropy of information for a discrete random vari- approximately determines a reasonable number of experi- 
able, the experimental information describes how much ments that provide sufficient data for an acceptable mod- 
information is provided by N experiments performed by eling of the joint PDF. However, it is still better to de- 
an instrument that is not infinitely accurate. [6] In accor- termine such a number from a properly introduced cost 
dance with these properties the experimental information function of the experimental observation. With this aim 
describes the complexity of experimental data in units of we consider the difference D xy (N) = I xy (oo) — I xy (N) as 
information entropy which are here nats. the measure of the discrepancy between the experimen- 
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tally observed and the true properties of the phenomenon. 
An information cost function is then comprised of the re- 
dundancy and the discrepancy measure: 

C xy (N)=R xy (N) + D xy (N). (30) 

Since the redundancy on average increases, while the dis- 
crepancy measure decreases with the number of measure- 
ments N, we expect that the cost function C xy (N) ex- 
hibits a minimum at a certain number N Ql which could be 
considered as an optimal one for the experimental model- 
ing of a phenomenon. From the definition of redundancy 
and the discrepancy measure we further obtain C xy {N) = 
R xy (N) + D xy {N) = log(N)-2I xy (N)+I xy (oo). Since the 
last term is a constant for a given phenomenon, it is not 
essential for the determination of N a , and can be omitted 
from the definition of the cost function. This yields a more 
simple version 

C xy (N)=log(N)-2I xy (N), (31) 

which is more convenient for application since it does not 
include the limit value I xy (oo). In a previous article [3] 
we have proposed a cost function that is comprised from 
the redundancy and the information measure of the dis- 
crepancy between the hypothetical and experimentally ob- 
served PDFs. However, such a definition is less convenient 
than the present one, although the values of N Q deter- 
mined from both cost functions do not differ essentially. 
Numerical investigations also show that the optimal num- 
ber N Q approximately corresponds to K xy (oo) = e /x " ( ' 00 ) 
if the distribution of the data points is approximately uni- 
form. 



laws from joint experimental data 9 

Although the experimental information of a vector vari- 
able and its scalar components exhibits similar properties, 
their values generally do not coincide since the overlapping 
of distributions ip(z, Zj) generally differs from that of dis- 
tributions tj)(x,Xi) or ip(y,yi). Therefore, the experimen- 
tal information provided by joint measurements generally 
differs from that provided by measurements of single com- 
ponents. 

2.5 Mutual information and determination of one 
variable by the other 

In order to describe the information corresponding to the 
relation between variables x,y we introduce conditional 
entropy. At a given value x we express the entropy per- 
taining to the variable y by the conditional PDF as 

H y]x = -J s f(y\x) l°g(^p) dy (32) 

If we express in Eq. (l2Tj) the joint PDF by the conditional 
one /(z) = f(y\x)f(x) we obtain the following equation: 

H xy = H y \ x + H x (33) 

in which H y \ x denotes the average conditional entropy of 
information 

HyJ x = - ( H ylx f(x) dx. (34) 

When we exchange the meaning of the variables we get 

H xy = H x \ y + H y . (35) 

Based on these equations and Eq. we obtain the fol- 
lowing relation between the joint and the conditional in- 
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formation 

** X \y ~\~ Hy H U Hy 

= Iy\x ~T" Ix = Ix\y ~t~ Iy (^6) 

where the conditional information is defined by 

Ix\ y = H x \ y — H u or I y \ x = Hy\ x — H v . (37) 

When the components of the vector z are statistically 
independent, the joint PDF is equal to the product of 
marginal probabilities and the joint information is given 
by the sum I xy = I x + I y , which represents the maxi- 
mal possible information that could be provided by joint 
measurements. However, when x and y are not statisti- 
cally independent, the joint information is less than the 
maximal possible one: I xy < I x + I y . The difference 

Ira Ix "T" Iy Ixy Ix Ix\y Iy Iy\x- (*^$) 

can be interpreted as the experimental information that 
a measurement of one variable provides about another one 
and is consequently called the mutual information. 6 8 |9|10j 
In accordance with the previous interpretation of the re- 
dundancy, it follows from the last two terms in Eq. (1381) 
that the mutual information also describes how redun- 
dant on average is a measurement of the variable y at a 
given x or vice versa. In accordance with the definition of 
the redundancy of a certain number N of measurements 
Rx(N) = log(-ZV) — I x , we further define also the mutual 
redundancy of N joint measurements 

R m (N)=log(N)-I m (N). (39) 

If we then take into account all the definitions of the re- 
dundancies and types of information, we obtain the for- 
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mula: 

R xy (N) = R X (N) + R y (N) - R m (N) (40) 

It should be pointed out that redundancies R xy (N) , R x (N) , 
R y (N), and R m (N) generally increase with N, while the 
corresponding experimental information tends to fixed val- 
ues that correspond to the amount of data needed for pre- 
senting related variables. 

In order to describe quantitatively how well determined 
the value of the variable y by the value of x is on aver- 
age, we propose a relative measure of determination by 
the ratio 

T> — _ Im _ , Iy\x , . 1 s 

1 y 1 y 

If D y \ x > D x \ yi the value of the variable x better deter- 
mines the value of y than vice versa. In this case the vari- 
able x could be considered as more fundamental for the 
description of the phenomenon, and consequently as an 
independent one. In the case of functional dependence de- 
scribed by a physical law y = y (x), the relative measure 
of determination is D y \ x = 1, while for the statistically 
independent variables x and y it is D y \ x = 0. 

The entropy of information is generally decreased if 
the distribution of scattered experimental data at a given 
x is compressed to the estimated physical law y{x). The 
corresponding information gain is in drastic contrast to 
the information loss that is caused by the noise in a mea- 
surement system. [TTj 
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3 Illustration of statistics 
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3.1 Data with a hidden law 

The purpose of this section is to demonstrate graphically 
the basic properties of the statistics introduced above. For 
this purpose it is most convenient to generate data nu- 
merically since in this case the relation between the vari- 
ables x and y, as well as the properties of the scatter- 
ing function ip(z), can be simply set. For our demonstra- 
tion we arbitrarily selected a third order polynomial law 
y (x) — [x(x — 5)(x + 10)]/100 and the Gaussian scatter- 
ing function with standard deviation a = 0.2. To simulate 
the basic data set i = 1, . . . ,N}, we hrst calcu- 

lated 50 sample values Xi by summing two random terms 
obtained from a generator with a uniform distribution in 
the interval [—8, +8] and from a Gaussian generator hav- 
ing the mean value and standard deviation a = 0.2. 
The corresponding sample values yi were then calculated 
as a sum of terms obtained from the selected law y {xi) 
and the same random Gaussian generator with a different 
seed. The generated data {xi, yi] i = 1, . . . , 50} were used 
as centers of scattering function when estimating the joint 
PDF based on Eq. ([2]). An example of such PDF is shown 
in Fig.[TJ while the corresponding joint data of the basic 
set are shown by points in the top curve of Fig. [^together 
with the underlying law y {x). 

The conditional average predictor, which corresponds 
to the presented example, was modeled by inserting data 
from the basic data set into Eq. ©. To demonstrate its 
performance, we additionally generated a test data set by 



N=50, a-0.2 



0.4 




-10 



Y 10 10 

Fig. 1. The joint PDF f(x,y) utilized to demonstrate the 
properties of the conditional average predictor. 

TESTING OF CA PREDICTOR 

10 I 1 1 1 1 1 i 1 1 1 




-2 - 

-10 -8 -6 -4 -2 2 4 6 8 10 
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Fig. 2. Testing of CA predictor. Curves representing the un- 
derlying law and given data y ,y - (top), test and predicted 
data yt,y P - (middle), and prediction error E r — y v — yt - 
(bottom) are displaced in vertical direction for a better visu- 
alization. 

the same procedure as in the case of the basic data set, but 
with different seeds of all the random generators. Using 
the values Xi t of the test set, we then predicted the cor- 
responding values iji by the modeled CA predictor. With 
this procedure we simulated a situation that is normally 
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met when a natural law is modeled and tested based upon 
experimental data. The test and predicted data are shown 
by the middle two curves in Fig.[2J From both data sets 
the prediction error E r — y — y t was calculated that is 
presented by the bottom curve (..*..) in Fig.[2j The curve 
representing the predicted data (— o— ) is smoother than the 
curve representing the original test data (■•••■)■ This prop- 
erty is a consequence of smoothing caused by estimating 
the conditional mean value from various data included in 
the modeled CA predictor. In spite of this smoothing, it is 
obvious that the characteristic properties of the relation 
between the variables x and y is approximately extracted 
from the given data by the CA predictor. This further 
means that the properties of the hidden law y = y {x) can 
be approximately described in the region where measured 
data appear based on a finite number of joint samples. 

The quality of estimation of the hidden law y {x) de- 
pends on the values and number N of statistical samples 
utilized in Eq. |2| in the modeling of CA and its testing. To 
demonstrate this property, we repeated the complete pro- 
cedure three times, using various statistical data sets with 
increasing N and determined the dependence of predic- 
tor quality Q on N. The result is presented in Fig. [3] The 
quality statistically fluctuates with the increasing N, but 
the fluctuations are ever less pronounced, so that quality 
determined from different data sets converges to a com- 
mon limit value at a large N. In our example with a = 0.2 
the limit value is approximately Q = 0.98. With increas- 
ing N, the curves corresponding to different data sets join 
approximately at Nca ~ 30. At a higher N the fluctua- 



PREDICTOR QUALITY 




Fig. 3. Dependence of predictor quality Q on number of sam- 
ples N determined by various statistical data sets. 

tions of Q are ever less expressive. We could conjecture 
that about 30 data values are needed to model the CA 
predictor in the presented case approximately. 

The smaller the scattering width a is, the higher gen- 
erally the limit value of the predictor quality is, but on 
average Q is still less than 1 if 1/a and N are finite. This 
property is in tune with the well-known fact that it is 
impossible to determine exactly the law y — y {x) from 
joint data that are measured by an instrument which is 
subject to output scattering due to inherent stochastic 
disturbances. 

The properties of the statistics that are formulated 
based upon the entropy of information are demonstrated 
for the case with a = 0.2 in Fig.0J It shows the depen- 
dence of experimental information I xy , mutual informa- 
tion I m , redundancy R xy , and cost function C xy on the 
number of samples N for three different sample sets. In 
the same figure the maximal possible information, which 
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Fig. 4. Dependence of log(N), experimental information I xy , 
mutual information I m , redundancy R xy , and cost function 
C xy on the number of samples N determined by various sta- 
tistical data sets. 

corresponds to the ideal case with no scattering, is also 
presented by the curve \og(N), since it represents the ba- 
sis for defining the redundancy. Similarly as in the one- 
dimensional case [3], the experimental information I xy in 
the two-dimensional case also converges with increasing 
TV to a fixed value. In the presented case the limit value 
is I xy (oo) £s 3.2, which yields the number sa 25. This 
number is approximately equal to the ratio of standard 
deviation of variable x and the scattering width a and 
describes how many uniformly distributed samples are 
needed to represent the PDF of the data. [3] Due to the 
convergence of experimental information to a fixed value, 
the curve I xy (N) starts to deviate from log(iV) with the in- 
creasing N. Consequently the redundancy R xy = log(TV) — 
Ixy(N) starts to increase, which further leads to the min- 
imum of the cost function C xy (N) — log(iV) — 2I xy (N). 




Fig. 5. Dependence of log(AT), experimental information I xy , 
marginal informations I x ,I y , and mutual information I m on 
the number of samples N . 

The minimum is not well pronounced due to statistical 
variations, but it takes place at approximately N a ss 30. 
Not surprisingly, the optimal number N a approximately 
corresponds to and also to Nca- 

Similarly as the joint experimental information I xy , the 
marginal experimental information I x , I y also converges 
to fixed values with increasing -/V.[3] These statistics are 
presented in Fig. [5] for the same data generator as applied 
in the case of Fig.UJ The sample values of variable x take 
place in a larger interval than those of variable y. Hence 
there is less overlapping of scattering functions comprising 
the marginal PDF of x and consequently I x is larger than 
I y . It is also characteristic that I xy is larger than I x since 
the data points in the joint span S xy are more separated 
than in the marginal span S x . Since the mutual informa- 
tion L m is defined as I m = I x + I y — I xy , its properties 
depend on both the marginal and the joint information, 



14 



Igor Grabec: Extraction of physical laws from joint experimental data 




10 20 30 40 50 60 70 80 90 100 



N 

Fig. 6. Dependence of log(iV), experimental information I xy , 
redundancy R xy , and cost function C xy on the number of sam- 
ples N determined from various data sets and scattering widths 
a. 
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Fig. 7. Dependence of relative measure of determination D y \ x 
- (top lines) and D x \ y - (bottom lines) on the number of sam- 
ples N determined from various statistical data sets. 



and consequently J m converges more quickly to the limit 
value than the experimental information I xy . 

To demonstrate the influence of scattering width on 
the presented statistics the calculations were repeated with 
a = 0.1 and 0.4. The results are presented in Fig. [6] For 
the sake of clear presentation, the curves representing the 
mutual information I m are omitted. As could be expected, 
the limit value of I xy increases with decreasing a. This 
property is consistent with the well-known fact that more 
information can be obtained by experimental observation 
when using an instrument of higher accuracy that corre- 
sponds to a lesser scattering width. In opposition to this, 
the redundancy of measurement decreases, and along with 
it, the optimal number N a increases with the decreasing 
scattering width. 



From the calculated mutual and marginal information, 
the relative measures of determination D y \ x and D x \ y were 
further determined using various statistical data sets. The 
results are presented in Fig. [7] for the case of scattering 
width cr = 0.2. When the number of data N surpasses the 
interval around the optimal number N a , statistical varia- 
tions of D y \ x and D x \ y become less pronounced and their 
values settle close to limit ones. The limit value D x \ y is 
essentially lower than D y \ x . This is the consequence of the 
fact that in our case the variable y is uniquely determined 
by the underlying law y (x) based upon the variable x, but 
not vice versa. In our case, there are three values of the 
variable x corresponding to a value of y in a certain inter- 
val. Consequently, y is better determined by a given x than 
vice versa, which further yields D y \ x > D x \ y . Hence the 
relative measure of determination indicates that variable x 
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N=500, o=0.2 
random data 
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Fig. 8. The joint PDF f(x,y) of TV = 500 statistically inde- 

Fig. 9. Dependence of log(TV), experimental information I xy , 

pendent random data with a — 0.2. 

redundancy R xy , and cost function C xy on the number of sam- 
could be considered more fundamental for the description pleg N determined by various statistical data sets and scatter _ 



ing widths a. 



of the relation between the variables x and y. 
3.2 Data without a hidden law 

To support the last conclusion let us examine an exam- 
ple in which the sample values of the variables x and 
y were calculated by two statistically independent ran- 
dom generators. The corresponding joint PDF is shown 
in Fig.0 while the properties of the other statistics are 
demonstrated by Figs. [51 [TUlandfrTl 

The properties of the presented statistics could be un- 
derstood, if the overlapping of scattering functions com- pig 10 Dep endence of log(iv), experimental information I xy , 
prising the estimator of the joint PDF is examined. In margina i informations I x ,I y , and mutual information I m on 
the previous case with the underlying law y (x), the joint the number of samples N in the case of statistically indepen- 
data are distributed along the corresponding line where dent random variables x, y. 
—8 < x < +8, while in the last case, they take place in 
the square region —8 < x < +8,-8 < y < +8. Conse- 
quently, the number of samples with nonoverlapping scat- 
tering functions in the last case is approximately L/a = 16 
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information I x , I y is approximately equal, the curves rep- 



50 100 150 200 250 300 350 400 450 500 



Fig. 11. Dependence of relative measure of determination D y 



- (top lines) and D x \ v - (bottom lines) on the number of ran- 
dom samples N in the case of statistically independent random 
data with a — 0.2. 

times larger than in the previous case. In the last case 
we can therefore expect the optimal number of samples 
in the interval around N ro w 16iV = 480. Since in the 
last case a larger region is covered by the joint PDF, the 
overlapping of scattering functions is less probable than 
previously, and therefore, the joint experimental informa- 
tion I xy deviates less quickly from the line log(iV) with 
the increasing N. Therefore, the redundancy increases less 
quickly and the minimum of the cost function takes place 
at a much higher number of N ro = 480, which corre- 
sponds well to our estimation. Since in the last case the 
experimental information I xy converges less quickly to the 
limit value than the marginal information I x , I y , the mu- 
tual information I m first increases and later decreases to 
its limit value. Related to this is the approach of rep- 



resenting D y \ x , D x \ y join with increasing N, and there is 
no argument to consider any variable as a more funda- 
mental one for the description of the phenomenon under 
examination. This conclusion is consistent with the fact 
that the centers of the scattering functions are determined 
by two statistically independent random generators. How- 



tive measures of determination D y \ x ,D x \ y to much lower 
limit values as in the previous case. Since the marginal 



ever, the limit values of the statistics D y \ x , D x < y are not 
equal to zero since the region —8 < x < +8, —8 < y < +8 
where the data appear is limited, while the characteristic 
region —a < x < +a, —a < y < +a covered by the joint 
scattering function does not vanish. 

4 Conclusions 

Following the procedures proposed in the previous article 
[3], we have shown how the joint PDF of a vector variable 
z = (x, y) can be estimated nonparametrically based upon 
measured data. For this purpose the inaccuracy of joint 
measurements was considered by including the scattering 
function in the estimator. It is essential that the properties 
of the scattering function need not be a priori specified, 
but could be determined experimentally based upon cali- 
bration procedure. The joint PDF was then transformed 
into the conditional PDF that provides for an extraction 
of the law y {x) that relates the measured variables x, y. 
For this purpose the estimation by the conditional average 
y Q {x) ~ Fi[y\x] is proposed. The quality of the prediction 
by the conditional average is described in terms of the es- 
timation error and the variance of the measured data. It 
is outstanding that the quality exhibits a convergence to 
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some limit value that represents the measure of applicabil- 
ity of the proposed approach. Examination of the quality 
convergence makes it feasible to estimate an appropriate 
number of joint data needed for the modeling of the law. 
It is important that the conditional average makes feasi- 
ble a nonparametric autonomous extraction of underlying 
law from the measured data. 

Using the joint PDF estimator we have also defined 
the experimental information, the redundancy of measure- 
ment and the cost function of experimental exploration. It 
is characteristic that experimental information converges 
with an increasing number of joint samples to a certain 
limit value which characterizes the number of nonoverlap- 
ping scattering distributions in the estimator of the joint 
PDF. The most essential terms of the cost function are 
the experimental information and the redundancy. Dur- 
ing cost minimization the experimental information pro- 
vides for a proper adaptation of the joint PDF model to 
the experimental data, while the redundancy prevents an 
excessive growth of the number of experiments. By the 
position of the cost function minimum we introduced the 
optimal number of the data that is needed to represent the 
phenomenon under exploration. This number roughly cor- 
responds to the ratio between the magnitude of the charac- 
teristic region where joint data appear and the magnitude 
of the characteristic region covered by the joint scattering 
function. It also corresponds to the appropriate number 
estimated from the quality of prediction by the conditional 
average. Based upon the experimental information corre- 
sponding to the joint and marginal PDFs, the mutual in- 
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formation has been introduced and further utilized in the 
definition of the relative measure of determination of one 
variable by another. This statistic provides an argument 
for considering one variable as a fundamental one for the 
description of the phenomenon. 

In this article we graphically present the properties of 
the proposed statistics by two characteristic examples that 
represent data related by a certain law and statistically 
independent random data. The exhibited properties agree 
well with the expectations given by experimental science. 
The problems related to the extraction of laws represent- 
ing relations such as y 2 + x 2 — 1 and the expression of 
physical laws by differential equations or analytical mod- 
eling were not considered. For this purpose the statistical 
methods are developed in the fields of pattern recognition, 
system identification and artificial intelligence. 
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